[FEA] Generate updated supported CSV files from plugin repo #847

cindyyuanjiang · 2024-03-08T02:10:10Z

Fixes #846

spark-rapids repo added CSV files generation per Apache Spark version under spark-rapids/tools/generated_files.

This PR syncs with the plugin and updated the existing CSV files under spark-rapids-tools/core/src/main/resources. The new files are generated using the python script included in this PR. The high-level approach is to take the union of the CSV files across all spark versions, which is explained in more detail in the python file.

Follow up:

This PR introduces new Execs. We need to add new parsers
Automate this process by setting up a regular Jenkins job
File an issue in the plugin repo to remove the stale files

Signed-off-by: cindyyuanjiang <[email protected]>

nartal1

Thanks @cindyyuanjiang !
Have couple of questions:

Dependencies for this script(pandas) has to be specified ?
What would be the function of jenkins job i.e this python script will update the supported*.csv files and create a new PR?

core/src/main/resources/dev/process_supported_files.py

nartal1 · 2024-03-08T21:12:49Z

core/src/main/resources/dev/process_supported_files.py

+    NA = 2
+    PS = 3
+    S = 4
+    CO = 5


CO is configured off, correct? So this takes precedence over S if there is an Exec with CO in one version and S in another Spark version?

@nartal1 Thanks! I need to confirm what CO means. The current script assumes CO takes the highest precedence. I can change this once I get an answer.

cindyyuanjiang · 2024-03-08T23:11:18Z

Thanks @nartal1!

Dependencies for this script(pandas) has to be specified?

I added some comments in the script for dependencies. My original intention was to include this script for review/future development reference. Do we need to specifically specify them in a file like requirements.txt?

What would be the function of jenkins job i.e this python script will update the supported*.csv files and create a new PR?

The jenkins job would likely have the following process:

Use the python script in this PR to generate the union of the CSV files from the spark-rapids repo
Compare the current union files with the existing files in tools repo
Update tools repo and file an automated PR for review

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein

Thanks @cindyyuanjiang for working on that new feature.
I haven't gotten the chance yet to review the PR, but I have one suggestion: It will be nice if the new python script accepts an input to override the data pulled from the plugin. For example we can have a config files in json format to describe some sort of explicit overriding we need to do do on the tools side.

Examples:

Previously, the tools resource files added manually promote_precision as SQL func name to PromotePrecision. the SQL func name did not exist in the Plugin files. It will be nice to have the configuration file doing that for us.
In some cases, users facing teams would ask to explicitly override a certain behavior for certain customers. These also can be part of the configuration files.
Down the road, (Future improvement) the config file can be an optional input. This allows to change the behavior without the need to build a custom wheel-file or make a new release.

CCing @viadea , @kuhushukla to get their thoughts because this feature would be used frequently by Hao and Kuhu.
CC @mattahrens

parthosa

Thanks @cindyyuanjiang. Few minor questions.

core/src/main/resources/dev/process_supported_files.py

parthosa · 2024-03-11T18:19:08Z

core/src/main/resources/dev/process_supported_files.py

+def unify_all_files(root_dir, file_name, key_names):
+    final_df = pd.DataFrame()


Docstrings should be inside the definition for all functions

https://peps.python.org/pep-0257/

Thanks @parthosa! I will update this.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-03-12T01:28:41Z

It will be nice if the new python script accepts an input to override the data pulled from the plugin. For example we can have a config files in json format to describe some sort of explicit overriding we need to do do on the tools side.

Thanks @amahussein! I updated the python script to accept an optional configs file which overrides the final output, and also included the config file we need to use for tools.

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang · 2024-03-12T23:35:31Z

What does CO mean in the supported CSV files?
CC: @amahussein @mattahrens

nartal1 · 2024-03-13T00:14:07Z

What does CO mean in the supported CSV files?

CO in this context means Configured Off specifically for read formats. If the readformat is off by default in the plugin, it's marked as CO. Here Avro and Json are off by default, so we end up adding CO for these read formats.
Code where CO is assigned: link

val readOps = types.map { t =>
          if (!formatEnabled) {
            // indicate configured off by default
            "CO"
          } else {
            read.support(t).text
          }
        }

JSON and AVRO formats disabled by default.
I think we can consider these same as "NS" as we have CollectLimitExec(disabled by default).
@amahussein @mattahrens - Your thoughts on this?

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein

Thanks @cindyyuanjiang

Can we have the generated files sorted alphabetically similar to what the plugin sources have?
I have some concerns about committing the changes in the CSV files:
- It is hard to tell if the new change overrides any manual change we did in those files. For example, if I haven't call out promote_precision, we would not see it. Is there another diff that can show us the change done to each operator? possible work around is to maintain the order of the columns so that the Diff won't be bogus.
- There are new rows introduced here, but we have not done any audit to assess the handling of the new operators. Especially for Execs and some of the UDF expressions added in exprs.csv

@cindyyuanjiang did you have any thoughts on how to address those issues?

CC @mattahrens

cindyyuanjiang · 2024-03-15T07:20:08Z

Thanks @amahussein!

Can we have the generated files sorted alphabetically similar to what the plugin sources have?

I looked into the generated files in plugin repo. They are not sorted alphabetically either. If we need to follow the order strictly, I will spend more time on it. Now the order is generally the same, with few exceptions.

It is hard to tell if the new change overrides any manual change we did in those files. For example, if I haven't call out promote_precision, we would not see it. Is there another diff that can show us the change done to each operator? possible work around is to maintain the order of the columns so that the Diff won't be bogus.

For the manual changes done in the tools side, they are currently not tracked (we can use the configs file override_supported_confugs.json to do it). I believe so far promote_precision is the only one from previous Jenkins job messages.

The current bogus diff is due to two new columns DAYTIME and YEARMONTH. If we remove these columns and compare with the old CSV files in plugin again, we see that:
in supportedDataSource.csv:
no difference

in supportedExecs.csv:

@@ -17 +17 @@
-AQEShuffleReadExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CustomShuffleReaderExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
@@ -24,4 +23,0 @@
-WriteFilesExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
-AppendDataExecV1,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
-AtomicCreateTableAsSelectExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
-AtomicReplaceTableAsSelectExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
@@ -29 +24,0 @@
-OverwriteByExpressionExecV1,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,NS,S,NS,PS,PS,PS,NS
@@ -52 +46,0 @@
-PythonMapInArrowExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
@@ -56 +49,0 @@
-WindowGroupLimitExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,PS,PS,PS,NS
@@ -58 +50,0 @@
-CustomShuffleReaderExec,S,None,Input/Output,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS

in supportedExprs.csv:

@@ -112,3 +111,0 @@
-BloomFilterMightContain,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,S,NA,NA,NA,NA,NA
-BloomFilterMightContain,S, ,None,project,rhs,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA
-BloomFilterMightContain,S, ,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -124,2 +121,2 @@
-CheckOverflowInTableInsert,S, ,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
-CheckOverflowInTableInsert,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
+CheckOverflow,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
+CheckOverflow,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -159,3 +156,3 @@
-DateAdd,S,`date_add`; `dateadd`,None,project,startDate,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateAdd,S,`date_add`; `dateadd`,None,project,days,NA,S,S,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateAdd,S,`date_add`; `dateadd`,None,project,result,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateAdd,S,`date_add`,None,project,startDate,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateAdd,S,`date_add`,None,project,days,NA,S,S,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateAdd,S,`date_add`,None,project,result,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -165,3 +162,3 @@
-DateDiff,S,`date_diff`; `datediff`,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateDiff,S,`date_diff`; `datediff`,None,project,rhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DateDiff,S,`date_diff`; `datediff`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateDiff,S,`datediff`,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateDiff,S,`datediff`,None,project,rhs,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+DateDiff,S,`datediff`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -185,6 +181,0 @@
-DivideDTInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DivideDTInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-DivideDTInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DivideYMInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-DivideYMInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-DivideYMInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -196,2 +186,0 @@
-Empty2Null,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-Empty2Null,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
@@ -300,2 +288,0 @@
-KnownNullable,S, ,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
-KnownNullable,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,S,S,S,S,PS,PS,PS,S
@@ -317,2 +304,2 @@
-Length,S,`char_length`; `character_length`; `len`; `length`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NS,NA,NA,NA,NA,NA
-Length,S,`char_length`; `character_length`; `len`; `length`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+Length,S,`char_length`; `character_length`; `length`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NS,NA,NA,NA,NA,NA
+Length,S,`char_length`; `character_length`; `length`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -379,6 +365,0 @@
-MultiplyDTInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-MultiplyDTInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-MultiplyDTInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-MultiplyYMInterval,S, ,None,project,lhs,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-MultiplyYMInterval,S, ,None,project,rhs,NA,S,S,S,S,S,S,NA,NA,NA,NS,NA,NA,NA,NA,NA,NA,NA
-MultiplyYMInterval,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -425,8 +406,2 @@
-PythonUDAF,S, ,None,aggregation,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,aggregation,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
-PythonUDAF,S, ,None,reduction,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,reduction,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
-PythonUDAF,S, ,None,window,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,window,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
-PythonUDAF,S, ,None,project,param,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NS,PS,NS,PS,NS
-PythonUDAF,S, ,None,project,result,S,S,S,S,S,S,S,S,PS,S,NS,NS,NS,NA,PS,NS,PS,NA
+PromotePrecision,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
+PromotePrecision,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -443,3 +418,3 @@
-RLike,S,`regexp_like`; `regexp`; `rlike`,None,project,str,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-RLike,S,`regexp_like`; `regexp`; `rlike`,None,project,regexp,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA
-RLike,S,`regexp_like`; `regexp`; `rlike`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
+RLike,S,`rlike`,None,project,str,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
+RLike,S,`rlike`,None,project,regexp,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA
+RLike,S,`rlike`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -479,6 +453,0 @@
-RoundCeil,S, ,None,project,value,NA,S,S,S,S,PS,PS,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-RoundCeil,S, ,None,project,scale,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-RoundCeil,S, ,None,project,result,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-RoundFloor,S, ,None,project,value,NA,S,S,S,S,PS,PS,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-RoundFloor,S, ,None,project,scale,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-RoundFloor,S, ,None,project,result,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -647 +616 @@
-XxHash64,S,`xxhash64`,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NS,NS,NS
+XxHash64,S,`xxhash64`,None,project,input,S,S,S,S,S,NS,NS,S,PS,S,S,S,NS,NS,NS,NS,NS,NS
@@ -668 +637 @@
-Average,S,`avg`; `mean`,None,aggregation,input,NA,S,S,S,S,S,S,NA,NA,NA,S,S,NA,NS,NA,NA,NA,NA
+Average,S,`avg`; `mean`,None,aggregation,input,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -670 +639 @@
-Average,S,`avg`; `mean`,None,reduction,input,NA,S,S,S,S,S,S,NA,NA,NA,S,S,NA,NS,NA,NA,NA,NA
+Average,S,`avg`; `mean`,None,reduction,input,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -672 +641 @@
-Average,S,`avg`; `mean`,None,window,input,NA,S,S,S,S,S,S,NA,NA,NA,S,S,NA,NS,NA,NA,NA,NA
+Average,S,`avg`; `mean`,None,window,input,NA,S,S,S,S,S,S,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
@@ -674,10 +643,6 @@
-BloomFilterAggregate,S, ,None,reduction,child,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-BloomFilterAggregate,S, ,None,reduction,estimatedItems,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-BloomFilterAggregate,S, ,None,reduction,numBits,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
-BloomFilterAggregate,S, ,None,reduction,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA
-CollectList,S,`array_agg`; `collect_list`,None,aggregation,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
-CollectList,S,`array_agg`; `collect_list`,None,aggregation,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
-CollectList,S,`array_agg`; `collect_list`,None,reduction,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
-CollectList,S,`array_agg`; `collect_list`,None,reduction,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
-CollectList,S,`array_agg`; `collect_list`,None,window,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
-CollectList,S,`array_agg`; `collect_list`,None,window,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
+CollectList,S,`collect_list`,None,aggregation,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CollectList,S,`collect_list`,None,aggregation,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
+CollectList,S,`collect_list`,None,reduction,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CollectList,S,`collect_list`,None,reduction,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
+CollectList,S,`collect_list`,None,window,input,S,S,S,S,S,S,S,S,PS,S,S,S,S,NS,PS,PS,PS,NS
+CollectList,S,`collect_list`,None,window,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA
@@ -766,2 +730,0 @@
-InSubqueryExec,S, ,None,project,input,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,NS,NS
-InSubqueryExec,S, ,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
@@ -773,4 +735,0 @@
-CheckOverflow,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-CheckOverflow,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-PromotePrecision,S,`promote_precision`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA
-PromotePrecision,S,`promote_precision`,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA

The diff may still look bogus, but if we look closely, we can see the difference is mainly due to:

new execs and expressions
longer SQL func values (date_add; dateadd vs date_add)
difference order from the plugin side

There are new rows introduced here, but we have not done any audit to assess the handling of the new operators. Especially for Execs and some of the UDF expressions added in exprs.csv

For the new Execs and Expressions, I should file follow up PRs to add the support.

amahussein

Thanks @cindyyuanjiang

The point I was making in my previous comment that regardless of followups we cannot have the new CSV merged as they are without proper evaluation.

The PR is not just about adding the script to the repo. The expectation is to make it working end-to-end with confidence that the current behavior does not break.

If we need to follow the order strictly, I will spend more time on it.

I do not understand why it would take more development time to sort the final CSV file by its first column. The script already uses Pandas dataframes.

The current bogus diff is due to two new columns DAYTIME and YEARMONTH. If we remove these columns and compare with the old CSV files in plugin again, we see that

The diff file is incorrect. I will show just one case here, but there is possibly other entries. For example, the entry below "Empty2Null" :

@@ -196,2 +186,0 @@
-Empty2Null,S, ,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA
-Empty2Null,S, ,None,project,result,NA,NA,NA,NA,NA,NA,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA

However, Empty2Null exists In the tools CSV file. Also, the override_supported_config has no entry to set the SQL function empty2null

spark-rapids-tools/core/src/main/resources/supportedExprs.csv

Lines 736 to 737 in b84d2ef

    
           Empty2Null,S,`empty2null`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA 
        
           Empty2Null,S,`empty2null`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

longer SQL func values (date_add; dateadd vs date_add)

This is not 100% correct. Another reason why the diff is bogus is because the order within the SQL function has changed on the plugin side. For example, in Tools CSV we have =; then == while on the plugin side we have ==, followed by =.

spark-rapids-tools/core/src/main/resources/supportedExprs.csv

Lines 193 to 198 in b84d2ef

    
           EqualTo,S,`=`; `==`,None,project,lhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS 
        
           EqualTo,S,`=`; `==`,None,project,rhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS 
        
           EqualTo,S,`=`; `==`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA 
        
           EqualTo,S,`=`; `==`,None,AST,lhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS 
        
           EqualTo,S,`=`; `==`,None,AST,rhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS 
        
           EqualTo,S,`=`; `==`,None,AST,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

https://github.com/NVIDIA/spark-rapids/blob/7177c3ae88661c758c776bd638a28f73a134b925/tools/generated_files/supportedExprs.csv?plain=1#L193-L198

The diff may still look bogus, but if we look closely, we can see the difference is mainly due to:

That's what is my initial comment was all about. What we need from this PR is to include a report of what is changing.
Obviously, diff is not the way to do it.
This PR introduces a script written in python which is a fully fledged framework. We can use its capability to report the actual difference for each record regardless of the columns order, and order within cells.
There is no need to reinvent the wheel as python package such as "Pandas" has all those functionalities.

Suggestions to move forward

What I expect from this PR is to get a CSV file that I can instantly use to replace the current Tools CSV file with confidence that the Q's behavior is identical.
It does not make much sense that we split the logic of the operation between Jenkins and this script.

Why then we would have this python script in the Github repo?
How to test when we have the logic split in two different places?
What if we faced a blocker to get the automated-PR implemented? Then we get stuck with non-End-to-End sync mechanism. If the logic is in one place in the github, then we should be able to run the script manually to update the CSV file without waiting for the Jenkins job is running (given all the complexity of how to the PR is going to work).

In my comment, I described different enums specific to changes coming from the Tools side "TON/TOFF"
The script should take an argument representing directory of the tools generated CSV file to be used as a reference to maintain the new changes.
The script should generate a report with the following sections:

"New entries": Entire new entries. Those entries should not be added automatically to the Tools CSV file.
"Removed entries": 1- We define ML functions that were not defined in the plugin's side. 2-Remember that the plugin side drops support to specific Spark versions and some how an entry may disappear from the record. On the tools side, entries stick once they are added to the system because the tools support eventlogs back to Spark2.x
"Modified entries": the changes in the support-level done to existing entries. Those are the entries that have been modified at the support-level scope compared to the existing Tools file.

The above generated report is important because the dev can manually audit it to evaluate the "merge" is doing its job and that's new changes won't be introduced without notice.
Optional: New entries can be added to the CSV file given that their support-level will be marked as TOFF or even TNEW (stands for newly added entry). Then, after proper testing, their support level can be restored. Note that this requires that the script appends them automatically to the 0verride-config files.

@nartal1 has good knowledge of operators and I suggest he can be a good source of assistance navigating through this since you can both meet in person.

amahussein · 2024-03-15T15:03:13Z

core/src/main/resources/dev/process_supported_files.py

+    NA = 1
+    NS = 2
+    CO = 3
+    PS = 4
+    S = 5


It is difficult to know why a given operator is getting its level of support; especially that we want to override that support on-demand. For instance, when Hao wants to troubleshoot an operator, he won't be able to tell whether the value comes from the plugin, overridden by the tools, or coming from a specific version of Spark.

I suggest we define two other members:

TOFF standing for "tools-OFF".

TON standing for "tools-ON".

This will make it easy for anyone to know that this operator is turned-on/off on the tools side.

when we want to enforce disabling an operator then:

Add an entry in the override-config file defining its support level to be "TOFF".

Run the script to generate new CSV file which will have TOFF

when we want to enforce enabling an operator then:

Add an entry in the override-config file defining its support level to be "TON".

Run the script to generate new CSV file which will have TON

This change requires that we mirror those two members on the Scala side as well.

Signed-off-by: cindyyuanjiang <[email protected]>

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein

Thanks @cindyyuanjiang !
This is going to be very useful script moving forward.

added updated supported files and script which generate the files

7ac42de

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang self-assigned this Mar 8, 2024

cindyyuanjiang requested review from mattahrens, nartal1 and parthosa March 8, 2024 18:57

nartal1 reviewed Mar 8, 2024

View reviewed changes

update copyright year and requirements and nit changes

5644301

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein reviewed Mar 11, 2024

View reviewed changes

parthosa reviewed Mar 11, 2024

View reviewed changes

added a file to override final CSV output

9bd0f23

Signed-off-by: cindyyuanjiang <[email protected]>

make configs file optional and cleaned up file

bb1823c

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang marked this pull request as ready for review March 12, 2024 23:36

reordered support level

5863a3d

Signed-off-by: cindyyuanjiang <[email protected]>

amahussein requested changes Mar 14, 2024

View reviewed changes

amahussein requested changes Mar 15, 2024

View reviewed changes

cindyyuanjiang added 3 commits March 17, 2024 17:02

added initial report implementation

f518d1a

Signed-off-by: cindyyuanjiang <[email protected]>

implementation of report and TNEW

8a046f3

Signed-off-by: cindyyuanjiang <[email protected]>

TODO for updating override configs for TNEW

6d30175

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang mentioned this pull request Mar 20, 2024

[FEA] Add new support types in core tools #861

Closed

amahussein added the core_tools Scope the core module (scala) label Mar 20, 2024

amahussein mentioned this pull request Mar 21, 2024

[FEA] Extend supportLevels in PluginTypeChecker #862

Closed

cindyyuanjiang added 2 commits March 21, 2024 17:02

added TNEW entries in override json file

0c815f6

Signed-off-by: cindyyuanjiang <[email protected]>

fixed issue in appending json entries which casued duplicates

6a9d6a5

Signed-off-by: cindyyuanjiang <[email protected]>

cindyyuanjiang requested review from amahussein and nartal1 March 22, 2024 00:28

cindyyuanjiang requested a review from parthosa March 22, 2024 00:28

amahussein added 3 commits March 22, 2024 07:22

add report to .gitignore

0d469da

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Fix typos and some styling issues

077ff77

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

Move the scripts to the repo scripts folder

324a352

Signed-off-by: Ahmed Hussein (amahussein) <[email protected]>

amahussein approved these changes Mar 22, 2024

View reviewed changes

amahussein merged commit fa15398 into NVIDIA:dev Mar 22, 2024
13 checks passed

amahussein mentioned this pull request Mar 22, 2024

[BUG] Audit newly added rows to the supported ops CSV files #865

Closed

cindyyuanjiang deleted the sync-supported-files branch March 22, 2024 18:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Generate updated supported CSV files from plugin repo #847

[FEA] Generate updated supported CSV files from plugin repo #847

cindyyuanjiang commented Mar 8, 2024 •

edited

Loading

nartal1 left a comment

nartal1 Mar 8, 2024

cindyyuanjiang Mar 8, 2024

cindyyuanjiang commented Mar 8, 2024 •

edited

Loading

amahussein left a comment

parthosa left a comment

parthosa Mar 11, 2024

cindyyuanjiang Mar 11, 2024

cindyyuanjiang commented Mar 12, 2024 •

edited

Loading

cindyyuanjiang commented Mar 12, 2024

nartal1 commented Mar 13, 2024 •

edited

Loading

amahussein left a comment

cindyyuanjiang commented Mar 15, 2024 •

edited

Loading

amahussein left a comment •

edited

Loading

amahussein Mar 15, 2024

amahussein left a comment

		def unify_all_files(root_dir, file_name, key_names):
		final_df = pd.DataFrame()

	Empty2Null,S,`empty2null`,None,project,input,NA,NA,NA,NA,NA,NA,NA,NA,PS,NA,NA,NA,NA,NA,NA,NA,NA,NA
	Empty2Null,S,`empty2null`,None,project,result,NA,NA,NA,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

	EqualTo,S,`=`; `==`,None,project,lhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS
	EqualTo,S,`=`; `==`,None,project,rhs,S,S,S,S,S,S,S,S,PS,S,S,S,NS,NS,NS,NA,PS,NS
	EqualTo,S,`=`; `==`,None,project,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA
	EqualTo,S,`=`; `==`,None,AST,lhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS
	EqualTo,S,`=`; `==`,None,AST,rhs,S,S,S,S,S,NS,NS,S,PS,S,NS,NS,NS,NS,NS,NA,NS,NS
	EqualTo,S,`=`; `==`,None,AST,result,S,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA

[FEA] Generate updated supported CSV files from plugin repo #847

[FEA] Generate updated supported CSV files from plugin repo #847

Conversation

cindyyuanjiang commented Mar 8, 2024 • edited Loading

nartal1 left a comment

Choose a reason for hiding this comment

nartal1 Mar 8, 2024

Choose a reason for hiding this comment

cindyyuanjiang Mar 8, 2024

Choose a reason for hiding this comment

cindyyuanjiang commented Mar 8, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

parthosa left a comment

Choose a reason for hiding this comment

parthosa Mar 11, 2024

Choose a reason for hiding this comment

cindyyuanjiang Mar 11, 2024

Choose a reason for hiding this comment

cindyyuanjiang commented Mar 12, 2024 • edited Loading

cindyyuanjiang commented Mar 12, 2024

nartal1 commented Mar 13, 2024 • edited Loading

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Mar 15, 2024 • edited Loading

amahussein left a comment • edited Loading

Choose a reason for hiding this comment

amahussein Mar 15, 2024

Choose a reason for hiding this comment

amahussein left a comment

Choose a reason for hiding this comment

cindyyuanjiang commented Mar 8, 2024 •

edited

Loading

cindyyuanjiang commented Mar 8, 2024 •

edited

Loading

cindyyuanjiang commented Mar 12, 2024 •

edited

Loading

nartal1 commented Mar 13, 2024 •

edited

Loading

cindyyuanjiang commented Mar 15, 2024 •

edited

Loading

amahussein left a comment •

edited

Loading